Phenogrouping heart failure with preserved or mildly reduced ejection fraction using electronic health record data

Background Heart failure (HF) with preserved or mildly reduced ejection fraction includes a heterogenous group of patients. Reclassification into distinct phenogroups to enable targeted interventions is a priority. This study aimed to identify distinct phenogroups, and compare phenogroup characteristics and outcomes, from electronic health record data. Methods 2,187 patients admitted to five UK hospitals with a diagnosis of HF and a left ventricular ejection fraction ≥ 40% were identified from the NIHR Health Informatics Collaborative database. Partition-based, model-based, and density-based machine learning clustering techniques were applied. Cox Proportional Hazards and Fine-Gray competing risks models were used to compare outcomes (all-cause mortality and hospitalisation for HF) across phenogroups. Results Three phenogroups were identified: (1) Younger, predominantly female patients with high prevalence of cardiometabolic and coronary disease; (2) More frail patients, with higher rates of lung disease and atrial fibrillation; (3) Patients characterised by systemic inflammation and high rates of diabetes and renal dysfunction. Survival profiles were distinct, with an increasing risk of all-cause mortality from phenogroups 1 to 3 (p < 0.001). Phenogroup membership significantly improved survival prediction compared to conventional factors. Phenogroups were not predictive of hospitalisation for HF. Conclusions Applying unsupervised machine learning to routinely collected electronic health record data identified phenogroups with distinct clinical characteristics and unique survival profiles. Supplementary Information The online version contains supplementary material available at 10.1186/s12872-024-03987-9.

performed using many different assays, the results were standardised by transforming each result to a ratio of the troponin value divided by the troponin assay upper limit of normal (ULN).424 patients (19%) had at least one missing value.Missing data (summarised in Supplemental Figure 1) were imputed using the missForest package in R, which is an iterative imputation method based on a random forest, that constitutes a multiple imputation scheme by averaging over multiple regression trees (1).The out of bag error was 0.07.Following imputation, categorical variables (all binary) were transformed to numerical, and all variables were scaled and standardised to a mean of 0 and standard deviation of 1.

Clustering methodology
Three clustering methods were applied:

Density-based clustering
Density-Based Spatial Clustering and Application with Noise (DBSCAN), a type of density-based clustering algorithm, was applied to the data initially.DBSCAN has several advantages.In contrast to other clustering methods, it does not require the user to specify the number of clusters.Furthermore, DBSCAN is more suited to finding arbitrary shaped clusters and to detect outliers in data (2).The algorithm requires the input of two parameters: the epsilon value, which is a distance metric around a data point 'x'; and the minimum points value, which describes the minimum number of other data points, or neighbours, within the radius of the epsilon.
The dbscan function in the fpc R package was applied.The minimum points value was calculated by multiplying the dimensionality of the data by 2, creating a minimum points value of 84.The optimal epsilon was then selected by plotting a k-distance plot (k corresponds to the minimum points value) and choosing the value at the 'elbow' point (3).

Model-based clustering
Gaussian mixture modelling, as implemented by Shah et al using the mclust package in R, is a form of model-based clustering that achieves parameter estimation with the use of an expectationmaximisation algorithm (4,5).A variety of covariance structures can be explored, and the optimal number of clusters can be determined with maximisation of the Bayesian Information Criterion (BIC).
The BIC penalises model complexity and allows selection of models with overall better fit.Between 1 and 9 clusters were explored.
Partition-based clustering K-means clustering, implemented here using the stats package in R, is a commonly used partitionbased clustering approach that divides the data into k-number of clusters (6).The number of clusters (k) is pre-defined by the user and the algorithm initiates with a random selection of k observations which act as cluster centroids.The remaining observations are then assigned to the closest centroid, calculated using the Euclidean distance.Following this, the centroids of each cluster are adjusted, and the observations are reassigned using the updated centroid values.Finally, the whole process is iteratively repeated to minimise within-cluster variation (defined as the sum of squared Euclidean distances) and achieve convergence.
Several methods, such as the average silhouette or gap statistic, exist to select the optimal number of clusters.In this study, k was determined using the NbClust package in R. NbClust selects the optimal cluster number by choosing the number that is calculated by the majority of 30 different indices (Supplemental Figure 2) (7).The algorithm was limited to select between 3 and 8 clusters.The optimal number chosen by most indices was 3 clusters.

Cluster stability
To investigate the validity of the clusters, cluster stability was assessed by calculating the mean Jaccard coefficient for each cluster across bootstrapped replicates (8).The Jaccard coefficient acts as a similarity measure, comparing the similarity of the clusters when the algorithm is repeated over bootstrapped samples.Values range between 0 to 1, with a value closer to 1 suggesting a greater degree of similarity and stability.
Cluster stability was evaluated using the clusterboot package in R. Given that the DBSCAN algorithm was unable to converge the data beyond a single cluster, cluster stability was not assessed.The modelbased clustering and k-means clustering approaches were repeated 100 and 1000 times respectively (1000 replicates was deemed too computationally inefficient for the model-based clustering approach).

Supplemental Table 1: National
Institute for Cardiovascular Outcomes Research (NICOR) definition of heart failure (implemented in the National Heart Failure Audit in England and Wales) ICD-10 indicates International Classification of Diseases, 10 th Revision